-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: add correct support for compressing file-like objects #174
Conversation
64b1dac
to
6223a01
Compare
Signed-off-by: Norbert Biczo <[email protected]>
Signed-off-by: Norbert Biczo <[email protected]>
6223a01
to
35216f3
Compare
ibm_cloud_sdk_core/base_service.py
Outdated
# Handle the compression for file-like objects. | ||
# We need to use a custom stream/pipe method to prevent | ||
# reading the whole file into the memory. | ||
request['data'] = GzipStream(raw) if isinstance(raw, io.IOBase) else gzip.compress(raw) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use this if-else format to avoid too-many-branches
linter error then I renamed the variable to raw
to do not exceed the max row length...:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would add some value if you were to add some additional function to the GzipStream() ctor so that you could also pass in a non-IOBase-type value for raw
and it would do the right thing (in that case, perhaps just wrap raw
with an appropriate IOBase-type class that could stream the bytes of raw
... in other words, beef up GzipStream just a bit so it can handle file-like objects and static strings/buffers).
This way, the details of how a particular requestBody is gzip-encoded is hidden inside GzipStream and not exposed up in this BaseService method AND we also get a stream-based zip compression being performed which might help with large JSON requestBodies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI when I was originally investigating this in our SDK I noticed from gzip.compress()
docs
Changed in version 3.11: Speed is improved by compressing all data at once instead of in a streamed fashion.
So there might not be much value in trying to be fully stream-based.
It also says:
Calls with mtime set to 0 are delegated to zlib.compress() for better speed.
but it isn't immediately obvious to me whether that delegation makes any difference to the streaming behaviour, but mtime
defaults to current time so it won't happen at present anyway (nb. mtime
only avialable from 3.8
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there might not be much value in trying to be fully stream-based.
In my opinion it would make the code cleaner (to have all compression related steps in the helper class) and although the performance is improved in 3.11, the memory usage could still be a problem when the file is large, right? I think that's Phil's main point.
mtime only avialable from 3.8
We still support 3.7 so that's not an option - at least for the gzip.compress
function. From my understanding the mtime
is only used during the decompressing so - I think - the streaming shouldn't affect the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the memory usage could still be a problem when the file is large, right? I think that's Phil's main point
My understanding is that in 3.11 the entire contents will be in memory in gzip anyway. So my point was that might not be worth making changes only for that reason, but as you say there are other benefits.
@@ -647,6 +647,34 @@ def test_gzip_compression(): | |||
assert prepped['headers'].get('content-encoding') == 'gzip' | |||
|
|||
|
|||
def test_gzip_compression_file_input(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to put the new test cases into a separate function to avoid the too-many-branches
linter error.
ibm_cloud_sdk_core/base_service.py
Outdated
# Handle the compression for file-like objects. | ||
# We need to use a custom stream/pipe method to prevent | ||
# reading the whole file into the memory. | ||
request['data'] = GzipStream(raw) if isinstance(raw, io.IOBase) else gzip.compress(raw) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would add some value if you were to add some additional function to the GzipStream() ctor so that you could also pass in a non-IOBase-type value for raw
and it would do the right thing (in that case, perhaps just wrap raw
with an appropriate IOBase-type class that could stream the bytes of raw
... in other words, beef up GzipStream just a bit so it can handle file-like objects and static strings/buffers).
This way, the details of how a particular requestBody is gzip-encoded is hidden inside GzipStream and not exposed up in this BaseService method AND we also get a stream-based zip compression being performed which might help with large JSON requestBodies.
Signed-off-by: Norbert Biczo <[email protected]>
I have updated the PR based on @padamstx's suggestion. Let me know what you think, I am open to change it you have major concerns or better ideas! |
e3746bc
to
7adcadd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
If possible, it probably wouldn't hurt to try to test these changes with a much larger payload, perhaps by using some examples or integration tests from the platform-services python SDK or perhaps cloudant.
@ricellis do you by chance have a scenario that might be a good test prior to merging?
Signed-off-by: Norbert Biczo <[email protected]>
7adcadd
to
bdc27d3
Compare
I'll try and run it through our suites tomorrow. |
No good I'm afraid, 167 failures, I didn't check each one individually, but they seem to be of two types.
and secondly:
|
FWIW the first is also what I saw testing locally when I tried to validate this change solved the issue reported in IBM/cloudant-python-sdk#554. |
Thanks a lot @ricellis, I will take a look! At first glance these issues should be reproducible in unit tests, but we'll see. |
Signed-off-by: Norbert Biczo <[email protected]>
Signed-off-by: Norbert Biczo <[email protected]>
Signed-off-by: Norbert Biczo <[email protected]>
Signed-off-by: Norbert Biczo <[email protected]>
@ricellis Could you do another test run when you have some time? I've fixed the issues I found, and tweaked the generated unit tests to handle |
Re-tested with new generated test fixes, all passing now. Thanks! |
@ricellis Thanks for the testing (and finding the bugs)! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
## [3.17.1](v3.17.0...v3.17.1) (2023-10-04) ### Bug Fixes * add correct support for compressing file-like objects ([#174](#174)) ([2f91105](2f91105))
🎉 This PR is included in version 3.17.1 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
This PR contains the fix for the bug in the request preparation step.
Previously file-like object couldn't be compressed and the code
threw and exception if the user tried to do it. The solution that can
be found in this PR, is a helper class called
GzipStream
which actlike a middle item between the reader and the writer and compresses
the data on the fly. More details can be found in the comments.